Course: Visual Analytics for Policy and Management

Prof. José Manuel Magallanes, PhD


Part 2: Visualizing Tabular data

Bivariate Case


Contents:

  1. Intro.

  2. Categorical-Categorical case.

  3. Categorical-Numerical case.

  4. Numerical-Numerical case.

Exercises:


We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.

As before, the nature of the data allows for some particular analytical techniques, while also providing some limitations to our inferences. Let’s see what we can visualize with the different combinations of data.

This time, I will use the data about crime from the Seattle Open Data portal:

link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))

The data available are:

names(crime)
##  [1] "Report.Number"               "Occurred.Date"              
##  [3] "year"                        "month"                      
##  [5] "weekday"                     "Occurred.Time"              
##  [7] "Occurred.DayTime"            "Reported.Date"              
##  [9] "Reported.Time"               "DaysToReport"               
## [11] "crimecat"                    "Crime.Subcategory"          
## [13] "Primary.Offense.Description" "Precinct"                   
## [15] "Sector"                      "Beat"                       
## [17] "Neighborhood"

A quick look will give us:

head(crime)
##    Report.Number Occurred.Date year month weekday Occurred.Time
## 1 20130000244104    2013-07-09 2013     7 Tuesday          1930
## 2 20130000242916    2013-07-09 2013     7 Tuesday          1917
## 3 20130000904541    2013-07-09 2013     7 Tuesday          1900
## 4 20130000243875    2013-07-09 2013     7 Tuesday          1900
## 5 20130000242883    2013-07-09 2013     7 Tuesday          1846
## 6 20130000242880    2013-07-09 2013     7 Tuesday          1844
##   Occurred.DayTime Reported.Date Reported.Time DaysToReport
## 1          evening    2013-07-10          1722            1
## 2          evening    2013-07-09          2052            0
## 3          evening    2013-07-10            35            1
## 4          evening    2013-07-10          1258            1
## 5          evening    2013-07-09          1846            0
## 6          evening    2013-07-09          1849            0
##                    crimecat         Crime.Subcategory
## 1                  NARCOTIC                  NARCOTIC
## 2                  BURGLARY      BURGLARY-RESIDENTIAL
## 3                 CAR PROWL                 CAR PROWL
## 4                     THEFT       MOTOR VEHICLE THEFT
## 5 FAMILY OFFENSE-NONVIOLENT FAMILY OFFENSE-NONVIOLENT
## 6                  BURGLARY      BURGLARY-RESIDENTIAL
##   Primary.Offense.Description Precinct Sector Beat      Neighborhood
## 1     NARC-FRAUD-PRESCRIPTION    NORTH      U   U3         SANDPOINT
## 2        BURGLARY-NOFORCE-RES    NORTH      L   L3          LAKECITY
## 3              THEFT-CARPROWL    SOUTH      R   R2       MOUNT BAKER
## 4              VEH-THEFT-AUTO    NORTH      U   U3 ROOSEVELT/RAVENNA
## 5                 CHILD-OTHER     WEST      Q   Q2        QUEEN ANNE
## 6          BURGLARY-FORCE-RES     WEST      Q   Q2        QUEEN ANNE

Let’s see what kind of data we have:

str(crime,width = 70,strict.width='cut')
## 'data.frame':    499698 obs. of  17 variables:
##  $ Report.Number              : chr  "20130000244104" "201300002429"..
##  $ Occurred.Date              : Date, format: "2013-07-09" ...
##  $ year                       : num  2013 2013 2013 2013 2013 ...
##  $ month                      : num  7 7 7 7 7 7 7 7 7 7 ...
##  $ weekday                    : Ord.factor w/ 7 levels "Monday"<"Tu"..
##  $ Occurred.Time              : num  1930 1917 1900 1900 1846 ...
##  $ Occurred.DayTime           : Ord.factor w/ 4 levels "day"<"after"..
##  $ Reported.Date              : Date, format: "2013-07-10" ...
##  $ Reported.Time              : num  1722 2052 35 1258 1846 ...
##  $ DaysToReport               : num  1 0 1 1 0 0 0 0 1 0 ...
##  $ crimecat                   : Factor w/ 20 levels "AGGRAVATED ASS"..
##  $ Crime.Subcategory          : Factor w/ 30 levels "AGGRAVATED ASS"..
##  $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
##  $ Precinct                   : Factor w/ 5 levels "EAST","NORTH",....
##  $ Sector                     : Factor w/ 23 levels "6804","9512",....
##  $ Beat                       : Factor w/ 64 levels "B1","B2","B3",...
##  $ Neighborhood               : Factor w/ 58 levels "ALASKA JUNCTIO"..

Go to table of contents.

Categorical-Categorical relationships

The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:

(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         3564      5366    4884  7501
##   ARSON                       196       167     191   486
##   BURGLARY                  24139     22288   14121 16082
##   CAR PROWL                 26740     38273   42595 34839
##   DISORDERLY CONDUCT           41        81      67    79
##   DUI                         706       939    2038  8522
##   FAMILY OFFENSE-NONVIOLENT  1748      2516    1217  1120
##   GAMBLE                        4         4       7     2
##   HOMICIDE                     41        46      49   131
##   LIQUOR LAW VIOLATION        112       491     410   606
##   LOITERING                    20        31      25     9
##   NARCOTIC                   2415      6416    3924  4109
##   PORNOGRAPHY                  65        53      17    31
##   PROSTITUTION                115       675    1425  1340
##   RAPE                        332       318     354   854
##   ROBBERY                    2584      4737    4139  5372
##   SEX OFFENSE-OTHER          1501      1759    1014  1776
##   THEFT                     38687     64868   38980 28410
##   TRESPASS                   4848      5184    2598  3289
##   WEAPON                      735      1445     947  1624

The table above shows counts, but in most situations, it is important to see relative values:

# using "pipes" to help readability:
library(magrittr)
(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
        prop.table() %>% #compute proportion and then...
        "*"(100)%>% # multiply by 100 and then...
        round(2) #...round to to decimals
        )
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         0.71      1.07    0.98  1.50
##   ARSON                      0.04      0.03    0.04  0.10
##   BURGLARY                   4.83      4.46    2.83  3.22
##   CAR PROWL                  5.35      7.66    8.53  6.98
##   DISORDERLY CONDUCT         0.01      0.02    0.01  0.02
##   DUI                        0.14      0.19    0.41  1.71
##   FAMILY OFFENSE-NONVIOLENT  0.35      0.50    0.24  0.22
##   GAMBLE                     0.00      0.00    0.00  0.00
##   HOMICIDE                   0.01      0.01    0.01  0.03
##   LIQUOR LAW VIOLATION       0.02      0.10    0.08  0.12
##   LOITERING                  0.00      0.01    0.01  0.00
##   NARCOTIC                   0.48      1.28    0.79  0.82
##   PORNOGRAPHY                0.01      0.01    0.00  0.01
##   PROSTITUTION               0.02      0.14    0.29  0.27
##   RAPE                       0.07      0.06    0.07  0.17
##   ROBBERY                    0.52      0.95    0.83  1.08
##   SEX OFFENSE-OTHER          0.30      0.35    0.20  0.36
##   THEFT                      7.75     12.99    7.80  5.69
##   TRESPASS                   0.97      1.04    0.52  0.66
##   WEAPON                     0.15      0.29    0.19  0.33

Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.

CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
         prop.table(margin = 2)%>%   # 2 is % by column
         "*"(100)%>%
         round(3)

CrimeCol
##                            
##                                day afternoon evening  night
##   AGGRAVATED ASSAULT         3.282     3.447   4.104  6.456
##   ARSON                      0.180     0.107   0.161  0.418
##   BURGLARY                  22.229    14.319  11.866 13.842
##   CAR PROWL                 24.624    24.588  35.794 29.987
##   DISORDERLY CONDUCT         0.038     0.052   0.056  0.068
##   DUI                        0.650     0.603   1.713  7.335
##   FAMILY OFFENSE-NONVIOLENT  1.610     1.616   1.023  0.964
##   GAMBLE                     0.004     0.003   0.006  0.002
##   HOMICIDE                   0.038     0.030   0.041  0.113
##   LIQUOR LAW VIOLATION       0.103     0.315   0.345  0.522
##   LOITERING                  0.018     0.020   0.021  0.008
##   NARCOTIC                   2.224     4.122   3.297  3.537
##   PORNOGRAPHY                0.060     0.034   0.014  0.027
##   PROSTITUTION               0.106     0.434   1.197  1.153
##   RAPE                       0.306     0.204   0.297  0.735
##   ROBBERY                    2.380     3.043   3.478  4.624
##   SEX OFFENSE-OTHER          1.382     1.130   0.852  1.529
##   THEFT                     35.626    41.674  32.756 24.453
##   TRESPASS                   4.464     3.330   2.183  2.831
##   WEAPON                     0.677     0.928   0.796  1.398

The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).

As before, we can use the bar plot with the contingency table as input:

barplot(CrimeCol)

This plot will need a lot of work, so the base capabilities of R may not be a good strategy.

However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:

df.T=as.data.frame(CrimeTotal) # table of proportion based on total
# YOU GET:
head(df.T)
##                 Var1 Var2 Freq
## 1 AGGRAVATED ASSAULT  day 0.71
## 2              ARSON  day 0.04
## 3           BURGLARY  day 4.83
## 4          CAR PROWL  day 5.35
## 5 DISORDERLY CONDUCT  day 0.01
## 6                DUI  day 0.14

We should rename the above table:

names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)
##                Crime Daytime Percent
## 1 AGGRAVATED ASSAULT     day    0.71
## 2              ARSON     day    0.04
## 3           BURGLARY     day    4.83
## 4          CAR PROWL     day    5.35
## 5 DISORDERLY CONDUCT     day    0.01
## 6                DUI     day    0.14

A first option you may have is to reproduce the table:

library(ggplot2)                           
base = ggplot(df.T, aes(Daytime,Crime)) 
# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray") 
# add value of Percent as label, move it
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
                                    nudge_x = 0.1,
                                    size=2)
tablePlot2

…some more work:

tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?
tablePlot4 = tablePlot3 + theme_minimal() # less ink
tablePlot4 + theme(legend.position="none") # no legend

The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot, but with better tools:

base  = ggplot(df.T, aes(x = Crime, y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" ) + theme_minimal()
# bar per day time with 'facet'
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 

…some more work:

# change the minimal theme
bars3 = bars2 + theme( axis.text.x = element_text(angle = 90,
                                                  hjust = 1,
                                                  size=3 ) )
bars3

And, the original relationship Input-Output table can be plotted like this:

df.C=as.data.frame(CrimeCol)
colnames(df.C)=c('Crime','Daytime','Percent')
#####

base  = ggplot(df.C, aes(x = reorder(Crime, Percent), y = Percent ) ) 

bars1 = base + geom_segment(aes(y = 0, x = reorder(Crime, Percent), yend = Percent, xend = Crime), stat = 'identity', color = "black")

bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) + geom_point()

bars3 = bars2 + coord_flip() + theme(axis.text.y = element_text(size=6,angle = 0)) + labs(x = "Total Percentage Per Daytime", y = "Percentage", title = "Percentage of Crime Based on Daytime", caption = "Source: Seattle Open Data Portal")

bars4 = bars3 + theme(panel.background = element_rect(fill = 'gray97'), plot.title=(element_text(hjust=0.5)), plot.caption=element_text(vjust = 0.1), axis.title.x = element_text(vjust = -1), axis.title.y = element_text(vjust = 2, hjust = .5))

bars4

#The type of crime is not ordinal, then we could reorder the bars:

Exercise 1:
Turn the bars into lollipop with the right components.

Once you see a plot of two bivariate categorical data, you may consider other plots:

# heatplot
base = ggplot(df.C, aes(x = Daytime, y = reorder(Crime, -Percent), fill = Percent)) #Add a (-) in front of percent
heat1 = base +  geom_tile() 
heat2 = heat1 +scale_fill_gradient(low = "yellow", 
                                   high = "purple")
heat2

Some work on the legend:

heat3 = heat2 + theme_classic() 

heat4 = heat3 + theme(axis.text.x = element_text(angle = 0, vjust = 0.6), 
                      legend.title = element_text(), #no title for legend
                      legend.position="top", 
                      legend.direction="horizontal",
                      legend.key.width=unit(1, "cm"),
                      legend.key.height=unit(1, "cm")) 

heat4 + labs(y="Crime")

Exercise 2:
Change the heatplot to ascending order, where intensity goes from yellow to purple. _____

Go to table of contents.

Categorical-Numerical relationships

Similar to the previous case, categorical variables can be used for understanding the behavior of numerical variables. In this case, the curiosity and experience of the analyst is critical in mining the data to reveal some insight. This is so because numerical data have longer value ranges than categorical data. The data used will be a good example of this.

In the previous data set we had a variable that informs the amount of days it takes someone to report a crime:

summary(crime$DaysToReport)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00     0.00     0.00     7.65     1.00 36525.00        2

There are several good categorical variables that can be used to study the behavior of this one. Let’s use precint:

tapply(crime$DaysToReport,crime$Precinct,mean)
##      EAST     NORTH     SOUTH SOUTHWEST      WEST 
##  7.216687  7.104350  7.219687 11.063208  4.897332

Above, you see the mean time (in days) it takes per precint for people to notify a crime. You can suddenly create a plot in your mind just by reading those values, but the plot you imagine may be far from this one:

boxplot(DaysToReport~Precinct,data=crime)

The plot above would not give so much insight, there is so much noise. The fact is that a better summary would tell us more to consider:

tapply(crime$DaysToReport,crime$Precinct,summary)
## $EAST
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00     7.22     1.00 36525.00 
## 
## $NORTH
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.000     0.000     7.104     1.000 14268.000 
## 
## $SOUTH
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00     7.22     1.00 11292.00 
## 
## $SOUTHWEST
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    11.06     1.00 11992.00 
## 
## $WEST
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.000     0.000     4.897     1.000 16801.000

From the information above, you know that for each precint, the 75% of crimes are reported in a day or less. If we consider that situation as the expected behavior, we could omit those cases:

boxplot(DaysToReport~Precinct,data=crime,
        subset = DaysToReport>1) #subsetting

We see no structure appear yet. Let me try different versions while teaching how to divide the screen:

par(mfrow=c(2,2)) # 2 rows, and two columns, by row:

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=7,
        main="One week or more")

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=30,
        main="One 30-day month or more")

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=180,
        main="180 days or more")

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=365,
        main="One year or more")

Up to this point, you need to be planing a good story. The situation is different for each case, but let’s build our story from the crimes that take a year or longer to report.

First, let’s see how many cases we have per precinct:

crimeYear=crime[crime$DaysToReport>=365,]
tapply(crimeYear$DaysToReport,crimeYear$Precinct,length)
##      EAST     NORTH     SOUTH SOUTHWEST      WEST 
##       263       555       248       230       362

The year the crime occurred would be another variable to consider, to see if we can filter more cases:

tapply(crimeYear$DaysToReport,crimeYear$year,length)
## 1908 1964 1973 1974 1975 1976 1977 1978 1979 1980 1985 1986 1987 1989 1990 
##    1    1    1    1    1    1    1    1    1    2    2    1    1    1    2 
## 1991 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 
##    2    4    1    6    3    5   20    8   39   52   14   28   36   43   76 
## 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 
##  136  167  130  141  121  160  141  135  127  125   67

If we were to plot by year, several years before 2000 are too few (cases with two or less may require a case study- some might even be typing mistakes). Let’s get rid of those old cases:

crimeY2000=crime[(crime$DaysToReport>=365) & (crime$year>=2000),]
tapply(crimeY2000$DaysToReport,crimeY2000$Precinct,length)
##      EAST     NORTH     SOUTH SOUTHWEST      WEST 
##       257       543       237       214       358

Now, we see a better boxplot:

boxplot(DaysToReport~Precinct,data=crimeY2000,
        main="One year or more (from 2000)")

For sure, it would be better if the numerical units were in years:

crimeY2000$YearsToReport=crimeY2000$DaysToReport/365
boxplot(YearsToReport~Precinct,data=crimeY2000,
        main="One year or more (from 2000)")

This is a good data subset, but you still see that the distribution is pretty similar in every precint. At this stage, you can try visualizing the outliers distributions, which start around year five:

boxplot(YearsToReport~Precinct, data=crimeY2000, subset = YearsToReport>=5,
        main="Five years or more (from 2000)")

If your story is about the similarity of distributions among precincts, you can start improving the last plots. But in case you need to show some variety you can try another relevant categorical variable.

Let’s try weekday and Ocurred.DayTime:

par(mfrow=c(2,1))

boxplot(YearsToReport~weekday,data=crimeY2000,
        main="One year or more BY WEEKDAY (from 2000)",las=2)

boxplot(YearsToReport~Occurred.DayTime,data=crimeY2000,
        main="One year or more BY TIME CRIME OCCURRED (from 2000)",las=2)

Not much variety is perceived; then let’s try crimecat and year:

par(mfrow=c(2,1))

boxplot(YearsToReport~year,data=crimeY2000,
        main="One year or more (from 2000)",las=2)

boxplot(YearsToReport~crimecat,data=crimeY2000,
        main="One year or more (from 2000)",las=2)

Years to report seems interesting, you have a decreasing tendency, then we can spend some time improving it via ggplot:

# no missing:
crimeYearGG=crimeY2000[complete.cases(crimeY2000$YearsToReport),]

base = ggplot(crimeYearGG,aes(x=factor(year), y=YearsToReport)) 
box  = base + geom_boxplot()
box

It may be the case that your audience does not know how to read a boxplot. It is a great plot, but encoding so much statistical information. Then we can go simple, and use lines connecting the easy-to-understand points in every boxplot:

base  = ggplot(crimeYearGG,aes(x=factor(year), y=YearsToReport))
mins = base + stat_summary(fun.y=min, # function for 'y' is min()
                           geom="line",
                           show.legend = T,size=1,
                           aes(group=1,col='Min'))
mins # just the min values

Let me add the max values:

minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",
                              linetype='dashed',
                              size=1,show.legend = F,
                              aes(group=1,col='Max'))

minsMaxs

Adding the median:

minsMaxsMd= minsMaxs + stat_summary(fun.y=median,
                                    geom="line",size=2,
                                    aes(group=1,col='Median'))
minsMaxsMd

Let’s take control of the colors by customizing the legend:

# Change color of lines:
all1=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red"))
all1

Now, let’s complete the story by telling how the data filtered behaves, that is, the crimes that took less than a year to report since 2000 (we will not include data from before):

# data preparation:

crimeWeek=crime[(crime$DaysToReport<365) & (crime$year>=2000),]

crimeWeek$WeeksToReport=crimeWeek$DaysToReport/7

crimeYearGG2=crimeWeek[complete.cases(crimeWeek$WeeksToReport) &complete.cases(crimeWeek$crimecat),]
#plotting it:
base = ggplot(crimeYearGG2,aes(x=factor(year), y=WeeksToReport)) 
mins = base + stat_summary(fun.y=min,size=1,
                           geom="line", linetype='dashed',show.legend = T,
                           aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",size=1,show.legend = F,
                              aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,
                                    geom="line",size=2,
                                    aes(group=1,col='Median'))
all2=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red")
                                      )
all2 

We also found variability in the type of crime, so we could try a story with it; first with Years to report (for crimes that took a year or longer to report, after year 2000):

base= ggplot(crimeYearGG,
             aes(x = reorder(crimecat, YearsToReport, FUN = max), # reorder!
                 y=YearsToReport)) 
mins = base + stat_summary(fun.y=min,size=1,
                           geom="line", linetype='dashed',show.legend = T,
                           aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",size=1,show.legend = F,
                              aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median, size=2,
                                    geom="line",
                                    aes(group=1,col='Median'))
all3=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red"))
all3 + coord_flip()

Now, for crimes that took less than year to report after year 2000:

base = ggplot(crimeYearGG2,
              aes(x = reorder(crimecat, WeeksToReport, FUN = max),
                  y=WeeksToReport)) 
mins = base + stat_summary(fun.y=min,size=1,
                           geom="line", linetype='dashed',show.legend = T,
                           aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",size=1,show.legend = F,
                              aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,size=2,
                                    geom="line",
                                    aes(group=2,col='Median'))
all3=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red"))
all4 = all3 + theme(axis.title.y=element_text(vjust = 2.5), axis.title.x = element_text(vjust = -1),panel.background = element_rect(fill = 'gray95'), plot.title=(element_text(hjust=0.5)), plot.caption=element_text(vjust=-1.5)) + labs(title = "Weeks To Report Based on Crime", caption = "Source: City of Seattle Crime Data", x = "Type of Crime", y = "Reporting Time (Measured in Weeks)")
#all3+coord_flip()
all4 + coord_flip()

Exercise 3:
Complete the information needed in the previous plots.

It is very common to hear in scientific texts about the mean difference test known as the one-way anova, which beyond describing, as we have done, seeks to show if the mean of the numerical variable varies or not accross the values of the categorical groups.

#making a subset:
anovaData=crimeY2000[crimeY2000$YearsToReport>=5,]

#checking the mean per factor value:
tapply(anovaData$YearsToReport, anovaData$Precinct, mean,na.rm=T)
##      EAST     NORTH     SOUTH SOUTHWEST      WEST 
##  8.402359  8.717236  8.278557  8.790947  8.853487
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
group.CI(YearsToReport ~ Precinct, 
         data=anovaData, 
         ci = 0.95)
##    Precinct YearsToReport.upper YearsToReport.mean YearsToReport.lower
## 1      EAST            9.424130           8.402359            7.380588
## 2     NORTH            9.206928           8.717236            8.227544
## 3     SOUTH            8.951394           8.278557            7.605721
## 4 SOUTHWEST            9.604103           8.790947            7.977791
## 5      WEST            9.844426           8.853487            7.862548
anovaData=anovaData[complete.cases(anovaData),]

# introducing ggpubr
library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
## 
##     mutate
ggline(data=anovaData,x = "Precinct", y = "YearsToReport",add = 'mean_ci',
       error.plot = "pointrange") + scale_y_continuous(breaks=seq(7,10,0.5))

# Compute the analysis of variance
res.aov <- aov(YearsToReport ~ Precinct, data = anovaData)

# Summary of the analysis
summary(res.aov)[[1]]$Pr[1]
## [1] 0.7958203
# non parametric
kruskal.test(YearsToReport ~ Precinct, data = anovaData)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  YearsToReport by Precinct
## Kruskal-Wallis chi-squared = 2.8431, df = 4, p-value = 0.5844

Go to table of contents.

Numerical-Numerical relationships

The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:

  • Aggregating days to report and neighborhood:
# 1. MEAN of days it takes to report a crime by neighborhood
daysByNeigh=tapply(crime$DaysToReport, crime$Neighborhood, mean,na.rm=T)

# you have:
head(daysByNeigh)
## ALASKA JUNCTION            ALKI   BALLARD NORTH   BALLARD SOUTH 
##        8.968437       14.463768        7.788203        5.262384 
##        BELLTOWN      BITTERLAKE 
##        5.012865        7.442241
  • Aggregating crimes by neighborhood
# 2. PROPORTION of crimes by neighborhood
crimesByNeigh=tapply(crime$crimecat, crime$Neighborhood, length)%>%      
                     prop.table()%>%
                     "*"(100)%>% 
                     round(2) 
head(crimesByNeigh)
## ALASKA JUNCTION            ALKI   BALLARD NORTH   BALLARD SOUTH 
##            1.35            0.49            2.13            2.96 
##        BELLTOWN      BITTERLAKE 
##            2.99            1.92
  • Converting to data Frames: We will transpose the result of tapply:
library(tibble)
as.data.frame(daysByNeigh)%>%rownames_to_column()
##                             rowname daysByNeigh
## 1                   ALASKA JUNCTION    8.968437
## 2                              ALKI   14.463768
## 3                     BALLARD NORTH    7.788203
## 4                     BALLARD SOUTH    5.262384
## 5                          BELLTOWN    5.012865
## 6                        BITTERLAKE    7.442241
## 7                   BRIGHTON/DUNLAP    8.533382
## 8                      CAPITOL HILL    4.936174
## 9          CENTRAL AREA/SQUIRE PARK   12.236743
## 10 CHINATOWN/INTERNATIONAL DISTRICT    3.269826
## 11          CLAREMONT/RAINIER VISTA    8.422852
## 12                    COLUMBIA CITY    7.620129
## 13              COMMERCIAL DUWAMISH   14.918301
## 14         COMMERCIAL HARBOR ISLAND    8.711656
## 15              DOWNTOWN COMMERCIAL    3.563182
## 16                  EASTLAKE - EAST    5.876214
## 17                  EASTLAKE - WEST    4.783784
## 18                    FAUNTLEROY SW    7.296050
## 19                       FIRST HILL    7.113924
## 20                          FREMONT    5.921374
## 21                          GENESEE    7.522171
## 22                       GEORGETOWN    5.054068
## 23                        GREENWOOD    5.589741
## 24                       HIGH POINT   17.031684
## 25                    HIGHLAND PARK    8.523060
## 26                     HILLMAN CITY   11.087692
## 27   JUDKINS PARK/NORTH BEACON HILL    8.485470
## 28                         LAKECITY   11.369918
## 29             LAKEWOOD/SEWARD PARK    7.532511
## 30                     MADISON PARK    6.662824
## 31                   MADRONA/LESCHI    7.781613
## 32                         MAGNOLIA    6.684218
## 33                  MID BEACON HILL    6.706631
## 34                      MILLER PARK    8.580793
## 35             MONTLAKE/PORTAGE BAY    6.749860
## 36                           MORGAN   12.205063
## 37                      MOUNT BAKER    5.378289
## 38                        NEW HOLLY    4.888386
## 39                    NORTH ADMIRAL   11.287831
## 40                NORTH BEACON HILL    7.155411
## 41                   NORTH DELRIDGE   11.480690
## 42                        NORTHGATE    6.537741
## 43                    PHINNEY RIDGE    6.598978
## 44                     PIGEON POINT   10.464837
## 45                   PIONEER SQUARE    4.211331
## 46                       QUEEN ANNE    7.578140
## 47                    RAINIER BEACH    6.885420
## 48                     RAINIER VIEW    9.779054
## 49                ROOSEVELT/RAVENNA    7.012361
## 50   ROXHILL/WESTWOOD/ARBOR HEIGHTS    9.711947
## 51                        SANDPOINT   10.295633
## 52                      SLU/CASCADE    5.187209
## 53                             SODO    5.818391
## 54                SOUTH BEACON HILL   12.132686
## 55                   SOUTH DELRIDGE   10.480810
## 56                       SOUTH PARK   14.166928
## 57                       UNIVERSITY    5.509077
## 58                      WALLINGFORD    8.033284

Knowing how it works, we can create two data frames:

daysByNeigh=as.data.frame(daysByNeigh)%>%rownames_to_column()
crimesByNeigh=as.data.frame(crimesByNeigh)%>%rownames_to_column()
  • Merging the two dataframes: Since both data frames have the same neighboorhood, we can make one data frame by mergeing them:
num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)
##           rowname daysByNeigh crimesByNeigh
## 1 ALASKA JUNCTION    8.968437          1.35
## 2            ALKI   14.463768          0.49
## 3   BALLARD NORTH    7.788203          2.13
## 4   BALLARD SOUTH    5.262384          2.96
## 5        BELLTOWN    5.012865          2.99
## 6      BITTERLAKE    7.442241          1.92

Once we have the data organized, the clear option is the scatterplot:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
plot1= base +  geom_point() 
plot1

We can improve the plot, this time introducing ggrepel:

library(ggrepel)
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,
                           label=rowname)) # you need this aesthetics!
plot1= base +  geom_point() 
plot1 + geom_text_repel()

We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,label=rowname)) 
plot1= base +  geom_point() 
plot1 + geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
                                         num_num$rowname, "")))

Notice the difference without ggrepel:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
scatp1 = base +  geom_point() 
scatp1 + geom_text(aes(label=ifelse(crimesByNeigh>=5,num_num$rowname, "")))

The good thing is that ggrepel makes better use of the space:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,label=rowname)) 
base +  geom_point() + geom_text_repel(aes(label=ifelse(num_num$rowname=='NORTHGATE',
                                                        num_num$rowname, "")))

An alternative, to highlight overlaping of points:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
scatp1 = base +  geom_hex(bins = 10)
scatp2= scatp1 + geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
                                                  num_num$rowname,
                                                  ""))) 
scatp2 + scale_fill_distiller(palette ="Greys",direction=1) # try -1
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.

The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
scatp1 = base +  stat_density_2d(aes(fill = ..density..), 
                                 geom = "raster", contour = FALSE)

scatp2=scatp1+geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
                                               num_num$rowname, "")))

scatp3 = scatp2 +  theme(legend.position='none', axis.title.y = element_text(size=10, vjust = 3), axis.title.x = element_text(size = 10, vjust = 0), title = element_text(size = 8)) + labs(x = "Average Days To Report", y = "Proportion of Crime", title = "Proportion of Crime and Average Days to Report Per Neighborhood", caption = "Source: City of Seattle Crime Data")

scatp4= scatp3 + scale_fill_distiller(palette="Greys", direction=1) 
scatp4

The extra space you see can dissappear using:

scatp5 = scatp4 +  scale_x_continuous(expand = c(0, 0), breaks=seq(0,16,2)) + 
         scale_y_continuous(expand = c(0, 0), limits=c(0,5)) 
scatp5
## Warning: Removed 4 rows containing non-finite values (stat_density2d).
## Warning: Removed 4 rows containing missing values (geom_text_repel).

Exercise 4:
Complete the elements missing in the previous plots.


Go to table of contents.

Back to course schedule menu